Kernel Density Estimation

In this section we will discuss probability density estimation
A density estimator is an algorithm which takes a $D$-dimensional dataset and produces an estimate of the $D$-dimensional probability distribution which that data is drawn from.
Kernel density estimation (KDE) is uses a mixture consisting of one Gaussian component per point, resulting.
Let's look at the motivation and use of KDE.

We begin with the standard imports:



In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

Motivating KDE: Histograms

A density estimator is an algorithm which seeks to model the probability distribution that generated a dataset.
For one dimensional data, you are probably already familiar with one simple density estimator: the histogram.
A histogram divides the data into discrete bins, counts the number of points that fall in each bin, and then visualizes the results in an intuitive manner.
For example, let's create some data that is drawn from two normal distributions:



In [ ]:

    
def make_data(N, f=0.3, rseed=1):
    rand = np.random.RandomState(rseed)
    x = rand.randn(N)
    x[int(f * N):] += 5
    return x

x = make_data(1000)

Let's plot this dataset.



In [ ]:

    
hist = plt.hist(x, bins=30)

Or, more properly:



In [ ]:

    
hist = plt.hist(x, bins=30, normed=True)

The normalization simply changes the scale on the y-axis
The normalization is chosen so that the total area under the histogram is equal to 1



In [ ]:

    
density, bins, _ = hist
widths = bins[1:] - bins[:-1]
(density * widths).sum()

Do you see any issues with this kind of density estimation?

The choice of bin size and location can lead to representations that have qualitatively different features!
if we look at a version of this data with only 20 points, the choice of how to draw the bins can lead to an entirely different interpretation of the data! Consider this example:



In [ ]:

    
x = make_data(20)
bins = np.linspace(-5, 10, 10)



In [ ]:

    
fig, ax = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True,
                       subplot_kw={'xlim':(-4, 9), 'ylim':(-0.02, 0.3)})
fig.subplots_adjust(wspace=0.05)
for i, offset in enumerate([0.0, 0.6]):
    ax[i].hist(x, bins=bins + offset, normed=True)
    ax[i].plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)

On the left, the histogram makes clear that this is a bimodal distribution.
On the right, we see a unimodal distribution with a long tail.
Doesn't look that these two histograms were built from the same data: so, how can you trust the intuition that histograms confer?
And how might we improve on this?

We can think of a histogram as a stack of blocks, where we stack one block within each bin on top of each point in the dataset.
Let's view this directly:



In [ ]:

    
fig, ax = plt.subplots()
bins = np.arange(-3, 8)
ax.plot(x, np.full_like(x, -0.1), '|k', markeredgewidth=1)
for count, edge in zip(*np.histogram(x, bins)):
    for i in range(count):
        ax.add_patch(plt.Rectangle((edge, i), 1, 1, alpha=0.5))
ax.set_xlim(-4, 8)
ax.set_ylim(-0.2, 8)

The actual problem is that the height of the block stack often reflects not on the actual density of points nearby, but on coincidences of how the bins align with the data points.
But what if, instead of stacking the blocks aligned with the bins, we were to stack the blocks aligned with the points they represent?
If we do this, the blocks won't be aligned, but we can add their contributions at each location along the x-axis to find the result. Let's try this:



In [ ]:

    
x_d = np.linspace(-4, 8, 2000)
density = sum((abs(xi - x_d) < 0.5) for xi in x)

plt.fill_between(x_d, density, alpha=0.5)
plt.plot(x, np.full_like(x, -0.1), '|k', markeredgewidth=1)

plt.axis([-4, 8, -0.2, 8]);

The result is a much more robust reflection of the actual data characteristics than is the standard histogram.
Still, the rough edges are not aesthetically pleasing, nor are they reflective of any true properties of the data.
In order to smooth them out, we might decide to replace the blocks at each location with a smooth function, like a Gaussian. Let's use a standard normal curve at each point instead of a block:



In [ ]:

    
from scipy.stats import norm
x_d = np.linspace(-4, 8, 1000)
density = sum(norm(xi).pdf(x_d) for xi in x)

plt.fill_between(x_d, density, alpha=0.5)
plt.plot(x, np.full_like(x, -0.1), '|k', markeredgewidth=1)

plt.axis([-4, 8, -0.2, 5]);

This smooth plot gives a much more accurate idea of the shape of the data distribution.
These last two plots are examples of KDE in one dimension: the first uses a so-called "tophat" kernel and the second uses a Gaussian kernel.
We'll now look at kernel density estimation in more detail.

Kernel Density Estimation in Practice

The hyper parameters of kernel density estimation are the kernel, which specifies the shape of the distribution placed at each point, and the kernel bandwidth, which controls the size of the kernel at each point.
You can find Scikit-Laers KDE implementation in the sklearn.neighbors.KernelDensity estimator, which handles KDE in multiple dimensions with one of six kernels.

Let's first show a simple example of replicating the above plot using the Scikit-Learn KernelDensity estimator:



In [ ]:

    
from sklearn.neighbors import KernelDensity

# instantiate and fit the KDE model
kde = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde.fit(x[:, None])

# score_samples returns the log of the probability density
logprob = kde.score_samples(x_d[:, None])

plt.fill_between(x_d, np.exp(logprob), alpha=0.5)
plt.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
plt.ylim(-0.02, 0.22)

The result here is normalized such that the area under the curve is equal to 1.